Correlation matrix;
Exploratory Factor Analysis vs. Confirmatory Factor Analysis; Run An Example with Categorical Data in R (psych package);
Factor analysis is a useful tool for investigating variable relationships for complex concepts such as socioeconomic status, dietary patterns, or psychological scales.
It allows researchers to investigate concepts that are not easily measured directly by collapsing a large number of variables into a few interpretable underlying factors.
In factor analysis, a factor is an latent (unmeasured) variable that expresses itself through its relationship with other measured variables.
Take for example a variable like cognition. We may want to measure a person’s cognition, but this is the kind of construct that would be impossible to measure using a single variable. It’s just too abstract and multifaceted, although it does represent a single concept. So instead, you may have to develop the scale with many items, each of which measures some more measurable part of leadership. The idea would be that there is an underlying unmeasurable factor, cognition, that causes people to respond in certain patterns on the many items on the scale.
The purpose of factor analysis is to analyze these patterns of response as a way of getting at this underlying factor. Factor analysis also allows you to use the weighted item responses to create what are called factor scores. These represent a single score for each person on the factor. Factor scores are nice because they allow you to use a single variable as a measure of the factor in the other analyses, rather than a set of items.
Factor analysis often applied for continuous variables, but for nominal variables we would have no problem when using fa()
in library psych
\(\Rightarrow\) As demonstrated below, using ordinal or nominal/binary data for factor analysis in R is no more difficult than using continuous data for factor analysis in R.
If variables include mixed variables. If that is the case use cor="mixed"
in fa()
of psych library
Now, we jump on an example: Depression data (NHANES, 2014-15) includes 10 ordinal variables. This approach would be better helping us see through the meaningfulness of FA.
DPQ010
Have little interest in doing thingsDPQ020*
Feeling down, depressed, or hopelessDPQ030
Trouble sleeping or sleeping too muchDPQ040
Feeling tired or having little energyDPQ050
Poor appetite or overeatingDPQ060*
Feeling bad about yourselfDPQ070
Trouble concentrating on thingsDPQ080
Moving or speaking slowly or too fastDPQ090*
Thought you would be better off deadDPQ100
Difficulty these problems have causedFew more variables of different ordinal scales, which are not used in this example
* more critical/serious variables
Values of each variable has the same format of “ordinal” numerical values:
Higher is the value \(\Rightarrow\) “more” likely to be depressed ???
options(width = 300)
<- read.csv("data/dep.csv",header = TRUE,
dep sep = ",", na.strings = "NA")
# na.strings : a character vector of strings which are to be
# interpreted as NA values.
head(dep)
SEQN DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060 DPQ070 DPQ080 DPQ090 DPQ100 SLD010H SLQ050 SLQ060
1 73557 1 0 0 0 0 0 0 0 0 1 7 1 2
2 73558 2 0 0 0 0 0 0 0 0 0 9 2 2
3 73561 2 1 0 3 3 0 0 0 0 1 9 2 2
4 73562 3 3 3 3 3 1 2 1 0 3 5 2 1
5 73564 0 1 0 1 0 0 0 0 0 0 9 2 1
6 73566 0 0 1 0 0 0 0 0 0 0 6 1 2
nrow(dep)
[1] 3651
<- dep[,2:11] x
We first check our correlation matrix. Visual analysis of the correlation matrix is done to reveal some similarities or correlations that can be found in the data between groups of questions.
<- cor(x)
R library(corrplot)
corrplot(R,type="upper",order="hclust")
The blue structures are positively correlated groups and the red structure represents a negatively correlated group that emerges among the data.
library(psych)
library(GPArotation)
head(x)
DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060 DPQ070 DPQ080 DPQ090
1 1 0 0 0 0 0 0 0 0
2 2 0 0 0 0 0 0 0 0
3 2 1 0 3 3 0 0 0 0
4 3 3 3 3 3 1 2 1 0
5 0 1 0 1 0 0 0 0 0
6 0 0 1 0 0 0 0 0 0
DPQ100
1 1
2 0
3 1
4 3
5 0
6 0
Important note for FA using psych lib of different types of variables
fa(data,n.factors)
has an additional argument: cor
with several options depending on input variable types
cor="cor"
Pearson correlation, for numeric variables with different scalescor="cov"
Covariance, for numeric variables with similar scalescor="tet"
tetrachoric, for dichotomous variablescor="poly"
polychoric, for ordinal variablescor="mixed"
mixed cor for a mixture of tetrachorics, polychorics, Pearsons, biserials, and polyserials, Yuleb is Yulebonett, Yuleq and YuleY are the obvious Yule coefficients as appropriateRemarks
May convert nominal variables into dummy variables but that increases dimensionality very quickly
Assumption for ordinal variable, it has latenent normal variable
Correlations between non-numeric variables is often related chi-square test stats
Since our variables are all ordinal lets use polychoric correlations
polychoric(x)
Call: polychoric(x = x)
Polychoric correlations
DPQ01 DPQ02 DPQ03 DPQ04 DPQ05 DPQ06 DPQ07 DPQ08 DPQ09 DPQ10
DPQ010 1.00
DPQ020 0.58 1.00
DPQ030 0.31 0.33 1.00
DPQ040 0.38 0.38 0.37 1.00
DPQ050 0.38 0.43 0.34 0.34 1.00
DPQ060 0.47 0.70 0.30 0.36 0.43 1.00
DPQ070 0.47 0.51 0.34 0.35 0.41 0.51 1.00
DPQ080 0.48 0.51 0.41 0.37 0.41 0.51 0.59 1.00
DPQ090 0.40 0.65 0.34 0.37 0.37 0.70 0.49 0.49 1.00
DPQ100 0.57 0.66 0.42 0.47 0.46 0.61 0.56 0.59 0.61 1.00
with tau of
1 2 3
DPQ010 0.296 1.06 1.5
DPQ020 0.368 1.22 1.6
DPQ030 -0.096 0.74 1.1
DPQ040 -0.665 0.68 1.1
DPQ050 0.333 1.07 1.5
DPQ060 0.660 1.36 1.7
DPQ070 0.635 1.26 1.6
DPQ080 0.980 1.52 1.9
DPQ090 1.646 2.12 2.4
DPQ100 0.629 1.60 2.1
Output has
fa
for ordinal variables of more than two values.
In this case, assuming two factors.
<- fa(dep[,2:11], nfactors=2, cor='poly', rotate = "oblimin") fa.out
fa.out
Factor Analysis using method = minres
Call: fa(r = dep[, 2:11], nfactors = 2, rotate = "oblimin", cor = "poly")
Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 com
DPQ010 0.54 0.15 0.44 0.56 1.1
DPQ020 0.19 0.67 0.69 0.31 1.2
DPQ030 0.69 -0.18 0.32 0.68 1.1
DPQ040 0.61 -0.06 0.32 0.68 1.0
DPQ050 0.54 0.05 0.34 0.66 1.0
DPQ060 -0.05 0.90 0.75 0.25 1.0
DPQ070 0.58 0.14 0.49 0.51 1.1
DPQ080 0.70 0.05 0.54 0.46 1.0
DPQ090 0.07 0.74 0.63 0.37 1.0
DPQ100 0.60 0.26 0.68 0.32 1.4
MR2 MR1
SS loadings 2.95 2.24
Proportion Var 0.29 0.22
Cumulative Var 0.29 0.52
Proportion Explained 0.57 0.43
Cumulative Proportion 0.57 1.00
With factor correlations of
MR2 MR1
MR2 1.00 0.78
MR1 0.78 1.00
Mean item complexity = 1.1
Test of the hypothesis that 2 factors are sufficient.
The degrees of freedom for the null model are 45 and the objective function was 4.72 with Chi Square of 17223.44
The degrees of freedom for the model are 26 and the objective function was 0.14
The root mean square of the residuals (RMSR) is 0.03
The df corrected root mean square of the residuals is 0.04
The harmonic number of observations is 3651 with the empirical chi square 246.89 with prob < 7.1e-38
The total number of observations was 3651 with Likelihood Chi Square = 524.04 with prob < 3.7e-94
Tucker Lewis Index of factoring reliability = 0.95
RMSEA index = 0.072 and the 90 % confidence intervals are 0.067 0.078
BIC = 310.76
Fit based upon off diagonal values = 1
Measures of factor score adequacy
MR2 MR1
Correlation of (regression) scores with factors 0.93 0.94
Multiple R square of scores with factors 0.87 0.89
Minimum correlation of possible factor scores 0.74 0.78
names(fa.out)
[1] "residual" "dof" "chi" "nh"
[5] "rms" "EPVAL" "crms" "EBIC"
[9] "ESABIC" "fit" "fit.off" "sd"
[13] "factors" "complexity" "n.obs" "objective"
[17] "criteria" "STATISTIC" "PVAL" "Call"
[21] "null.model" "null.dof" "null.chisq" "TLI"
[25] "RMSEA" "BIC" "SABIC" "r.scores"
[29] "R2" "valid" "score.cor" "weights"
[33] "rotation" "communality" "communalities" "uniquenesses"
[37] "values" "e.values" "loadings" "model"
[41] "fm" "rot.mat" "Phi" "Structure"
[45] "method" "scores" "R2.scores" "r"
[49] "np.obs" "fn" "Vaccounted"
plot(fa.out$values,type='b')
biplot(fa.out)
head(fa.out$scores)
MR2 MR1
[1,] -0.3143623 -0.3168916
[2,] -0.5140691 -0.4382790
[3,] 0.7452267 0.2120120
[4,] 3.2561279 2.2674301
[5,] -0.6919326 -0.3763604
[6,] -0.7741730 -0.6857362
Let’s check factor scores of subject 122, who is really depressed!
122,] x[
DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060 DPQ070 DPQ080 DPQ090
122 3 3 3 2 2 3 2 0 3
DPQ100
122 3
$scores[122,] fa.out
MR2 MR1
5.193272 6.472891
Really large numbers, look at the second score
Let’s get predicted values of x[122,]
and they should be high
apply(x,2,mean) + fa.out$scores[122,]%*%t(as.matrix(fa.out$loadings[,1:2]))
DPQ010 DPQ020 DPQ030 DPQ040 DPQ050 DPQ060 DPQ070
[1,] 4.350776 5.852346 3.323438 3.933419 3.722472 6.000532 4.341527
DPQ080 DPQ090 DPQ100
[1,] 4.179323 5.194817 5.158928
Although 2nd factor is not significant according to Elbow rule, we still include it in analysis for its meaning!
Interpret the output as of numeric variables
But, care must be taken while generalizing FA of non-numeric variables
Interpretation:
Alternative way of doing same FA using correlation matrix but less output(no scores)
<- polychoric(x)$rho
poly.R <- fa(r=poly.R, nfactors=2, n.obs = nrow(dep[,2:11])) fa.out
fa.diagram(fa.out, digits = 2)
The relationship of each variable to the underlying factor is expressed by the so-called factor loading. Let’s go through each part of the printed output.
round(cbind(fa.out$loadings,"h2"=fa.out$communality,"u2"=fa.out$uniquenesses,"com"=fa.out$complexity),2)
MR2 MR1 h2 u2 com
DPQ010 0.54 0.15 0.44 0.56 1.15
DPQ020 0.19 0.67 0.69 0.31 1.17
DPQ030 0.69 -0.18 0.32 0.68 1.14
DPQ040 0.61 -0.06 0.32 0.68 1.02
DPQ050 0.54 0.05 0.34 0.66 1.02
DPQ060 -0.05 0.90 0.75 0.25 1.01
DPQ070 0.58 0.14 0.49 0.51 1.11
DPQ080 0.70 0.05 0.54 0.46 1.01
DPQ090 0.07 0.74 0.63 0.37 1.02
DPQ100 0.60 0.26 0.68 0.32 1.37
What’s MR, ML, PC etc.? These are factors, and the name merely reflects the fitting method, e.g. minimum residual, maximum likelihood, principal components. The default is minimum residual, so in this case MR.
Why are they ‘out of order’? the number assigned is arbitrary, but this has to do with a rotated solution. See the help file for more details, otherwise they are numbered in terms of variance accounted for.
h2: the amount of variance in the item/variable explained by the (retained) factors. It is the sum of the squared loadings, a.k.a. communality.
u2: 1 - h2. residual variance, a.k.a. uniqueness
com: Item complexity. Specifically it is “Hoffman’s index of complexity for each item. This is just \((\sum \lambda_i^2)^2/\sum \lambda_i^4\), where \(\lambda_i\) is the factor loading on the \(i^{th}\) factor. Basically it tells you how much an item reflects a single construct. It will be lower for relatively lower loadings.
$Vaccounted fa.out
MR2 MR1
SS loadings 2.9498769 2.2362545
Proportion Var 0.2949877 0.2236254
Cumulative Var 0.2949877 0.5186131
Proportion Explained 0.5688010 0.4311990
Cumulative Proportion 0.5688010 1.0000000
The variance accounted for portion of the output can be explained as follows:
model$Vaccounted
.The variable with the strongest association to the underlying latent variable. Factor 1, is DPQ060 (Feeling bad about yourself), with a factor loading of 0.90.
Since factor loadings can be interpreted like standardized regression coefficients, one could also say that the variable DPQ060 has a correlation of 0.90 with Factor 1. This would be considered a strong association for a factor analysis in most research fields.
Two other variables, DPQ090 (Thought you would be better off dead) and DPQ020 (Feeling down, depressed, or hopeless), are also associated with Factor 1. Based on the variables loading highly onto Factor 1, we could call it “Low self-esteem individual awareness.”
DPQ010, DPQ030, DPQ040, DPQ050, DPQ070, DPQ080 and DPQ100, however, have high factor loadings on the Factor 2. They seem to indicate the observed signs of depression, so we may want to call Factor 2 “Physical depression signs.”
$Phi fa.out
MR2 MR1
MR2 1.0000000 0.7835837
MR1 0.7835837 1.0000000
Factor correlations
Whether you get this part of the analysis depends on whether or not these are estimated. You have to have multiple factors and a rotation that allows for the correlations.
model$Phi
Model test results and Fit Indices (already discuss in SEM)
An feature of EFA is that the axes of the factors can be rotated within the multidimensional variable space.
What a factor analysis program does while determining the best fit between the variables and the latent factors. We have 10 variables that go into a factor analysis.
The program looks first for the strongest correlations between variables and the latent factor, and makes that Factor 1. Visually, one can think of it as an axis (Axis 1). The factor analysis program then looks for the second set of correlations and calls it Factor 2, and so on. Sometimes, the initial solution results in strong correlations of a variable with several factors or in a variable that has no strong correlations with any of the factors.
<- factor.rotate(fa.out,0,plot=TRUE,xlim=c(-1,1),ylim=c(-1,1), title="oblimin rotation") fa.dem
Rotations that allow for correlation are called oblique rotations; rotations that assume the factors are not correlated are called orthogonal rotations. Our graph shows an orthogonal rotation. We run the example above using the oblimin (oblique). Below, we test with varimax (orthogonal):
<- fa(dep[,2:11], nfactors=2, cor='poly', rotate = "varimax")
fa.out fa.out
Factor Analysis using method = minres
Call: fa(r = dep[, 2:11], nfactors = 2, rotate = "varimax", cor = "poly")
Standardized loadings (pattern matrix) based upon correlation matrix
MR2 MR1 h2 u2 com
DPQ010 0.40 0.53 0.44 0.56 1.9
DPQ020 0.71 0.43 0.69 0.31 1.6
DPQ030 0.17 0.54 0.32 0.68 1.2
DPQ040 0.24 0.52 0.32 0.68 1.4
DPQ050 0.31 0.49 0.34 0.66 1.7
DPQ060 0.81 0.31 0.75 0.25 1.3
DPQ070 0.41 0.56 0.49 0.51 1.8
DPQ080 0.38 0.63 0.54 0.46 1.6
DPQ090 0.71 0.34 0.63 0.37 1.4
DPQ100 0.53 0.63 0.68 0.32 1.9
MR2 MR1
SS loadings 2.60 2.58
Proportion Var 0.26 0.26
Cumulative Var 0.26 0.52
Proportion Explained 0.50 0.50
Cumulative Proportion 0.50 1.00
Mean item complexity = 1.6
Test of the hypothesis that 2 factors are sufficient.
The degrees of freedom for the null model are 45 and the objective function was 4.72 with Chi Square of 17223.44
The degrees of freedom for the model are 26 and the objective function was 0.14
The root mean square of the residuals (RMSR) is 0.03
The df corrected root mean square of the residuals is 0.04
The harmonic number of observations is 3651 with the empirical chi square 246.89 with prob < 7.1e-38
The total number of observations was 3651 with Likelihood Chi Square = 524.04 with prob < 3.7e-94
Tucker Lewis Index of factoring reliability = 0.95
RMSEA index = 0.072 and the 90 % confidence intervals are 0.067 0.078
BIC = 310.76
Fit based upon off diagonal values = 1
Measures of factor score adequacy
MR2 MR1
Correlation of (regression) scores with factors 0.87 0.82
Multiple R square of scores with factors 0.75 0.68
Minimum correlation of possible factor scores 0.51 0.35
Sometimes, the orthogonal rotation did not work out. i.e. no variable is loading highly onto another factor.
<- par(mfrow=c(1,2))
op cluster.plot(fa.out,xlim=c(-1,1),ylim=c(-1,1),title="Unrotated ")
<- factor.rotate(fa.out,-65,plot=TRUE,xlim=c(-1,1),ylim=c(-1,1),title="rotated -65 degrees") f2r
EFA | CFA | |
---|---|---|
What gets analyzed | ||
Correlation matrix (of items = indicators) | Covariance matrix (of items = indicators) | |
➢ Only correlations among observed item responses are used ➢ Only a standardized solution is provided, so the original means and variances of the observed item responses are irrelevant |
➢ Means, variances, and covariances of item responses are analyzed ➢ Item response means historically have been ignored ➢Output includes unstandardized AND standardized solutions |
|
Interpretation | ||
Rotation | Defining interpretation | |
➢ All items load on all factors (latent traits) ➢ Goal is to pick a rotation that gives closest approximation to “simple structure” ➢ No way of distinguishing latent variables due to traits being measured from correlation |
➢ CFA is theory-driven: any structure becomes a testable hypothesis ➢ You specify number of latent factors and their inter-correlations ➢You specify which items load on which latent factors ➢ You specify any additional relationships for method/other covariance ➢ You just need a clue; you don’t have to be right (misfit is informative) |
|
Model Fit | ||
Eye-balls and Opinion | Inferential tests via ML | |
➢ #Factors? Scree plots, interpretability ➢ Which rotation? ➢ Which items load on each factor? Arbitrary cutoff of .3-.4ish ➢ Standard errors infrequently used) |
➢ Global model fit test (and local) ➢Standard errors (and significance tests) of item loadings, error variances, and error covariances (all parameters) ➢ Ability to test appropriateness of model constraints or model additions via tests for change in model fit |
|
Factor scores | ||
Don’t ever use factor scores from an EFA | Factor scores can be used, but only if necessary | |
➢ Factor scores are indeterminate (especially due to rotation) ➢ Inconsistency in how factor models are applied to data |
Best option: Test relations among latent factors directly through SEM ➢ Factor scores are less indeterminate in CFA, and could be used |
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nguyen (2022, Sept. 19). HaiBiostat: Exploratory / Confirmatory Factor Analysis?. Retrieved from https://hai-mn.github.io/posts/2022-09-15-Factor Analysis/
BibTeX citation
@misc{nguyen2022exploratory, author = {Nguyen, Hai}, title = {HaiBiostat: Exploratory / Confirmatory Factor Analysis?}, url = {https://hai-mn.github.io/posts/2022-09-15-Factor Analysis/}, year = {2022} }